mpg data

ggplot2 dataset, you need to first load tidyverse or ggplot2

Fuel economy data from 1999 and 2008 for 38 popular models of car

library(tidyverse)
?mpg
head(mpg,3)
## # A tibble: 3 x 11
##   manufacturer model displ  year   cyl trans drv     cty   hwy fl    class
##   <chr>        <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi         a4      1.8  1999     4 auto… f        18    29 p     comp…
## 2 audi         a4      1.8  1999     4 manu… f        21    29 p     comp…
## 3 audi         a4      2    2008     4 manu… f        20    31 p     comp…
mpg$class[1:20]
##  [1] "compact" "compact" "compact" "compact" "compact" "compact" "compact"
##  [8] "compact" "compact" "compact" "compact" "compact" "compact" "compact"
## [15] "compact" "midsize" "midsize" "midsize" "suv"     "suv"

Data Visualization with ggplot: Barplots

geom_bar is designed to make it easy to create bar charts that show counts

g <- ggplot(mpg, aes(class))
# Number of cars in each class:
g + geom_bar()

Stacked

Bar charts are automatically stacked when multiple bars are “defined”" at the same location.
description of drv: f = front-wheel drive, r = rear wheel drive, 4 = 4wd

g + geom_bar(aes(fill = drv))

Side by side

To have side by side plots for the groups you would like to compare, you need to use the position argument within the geom_ functions. You can either simply assign the name of positioning as a string or assign a position_ function for more details.

g + geom_bar(aes(fill = drv), position = "dodge")

g + geom_bar(aes(fill = drv),position = position_dodge(width = 0.4))

Visualizing data points or mean as barplot

Unfortunately ggplot2 will not calculate means of a variable for you.
If you would like to plot means or any other values other than the count of categories you need to provide them to ggplot() in the following format:

Assuming that outcome is the mean outcome of treatments (trt) a, b and c here.

df <- data.frame(trt = c("a", "b", "c"), outcome = c(2.3, 1.9, 3.2))
df
##   trt outcome
## 1   a     2.3
## 2   b     1.9
## 3   c     3.2

Then use geom_col()

ggplot(df, aes(trt, outcome)) +
  geom_col()

You can also change colors easily

ggplot(df, aes(trt, outcome)) +
  geom_col(fill="steelblue")

ggplot(df, aes(trt, outcome)) +
  geom_col(aes(fill=trt))

geom_bar() with continuous data

You can also use geom_bar() with continuous data, in which case it will show counts at unique locations.

ggplot(iris, aes(Sepal.Length)) + geom_bar()

Data Visualization with ggplot: Histograms

Histograms are basically barplots with “bins”.
To construct a histogram,

  1. “bin” the range of values. What is meant by “bining” is, to divide the entire range of values into a series of intervals.
  2. Then count how many values fall into each interval.

The bins are usually specified as consecutive, non-overlapping intervals of a variable.

ggplot(iris, aes(Sepal.Length)) + geom_histogram(binwidth = 0.2)

ggplot(iris, aes(Sepal.Length, fill=Species)) + 
  geom_histogram(binwidth = 0.5)

ggplot(iris, aes(Sepal.Length, fill=Species)) + 
  geom_histogram(binwidth = 0.5, alpha=0.4)

ggplot(iris, aes(Sepal.Length, fill=Species)) + 
  geom_histogram(binwidth = 0.5, position = "identity", alpha=0.4)

Data Visualization with ggplot: Boxplots

A box plot is a easy way to get some overall idea about the distribution of your variable. Let’s have a quick review on how to read a box plot:

Boxplot (with an interquartile range) and a probability density function (pdf) of a Normal N(0,σ2) Population

Boxplot (with an interquartile range) and a probability density function (pdf) of a Normal N(0,σ2) Population

Example ToothGrowth dataset

head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
?ToothGrowth

Here’s how the the distribution of tooth length for different regimens look like.

gb <- ToothGrowth %>% 
      unite(regimen, -len, sep= "_") %>% 
      ggplot(data=., aes(regimen, len)) +
      geom_boxplot(aes(color=regimen))

Are you confused with this code?
We could get the exact same result with the following:

ToothGrowth$regimen <- paste(ToothGrowth$supp, ToothGrowth$dose, sep="_")

gb <- ggplot(data=ToothGrowth, aes(regimen, len)) +
      geom_boxplot(aes(color=regimen))
gb

So you see, these dplyr functions just make life a little easier, and potentially your code looks cleaner and easier to read. We’ll talk about that later.

Bonus: geom_dotplot

What’s a dotplot?

In a dot plot, the width of a dot corresponds to the bin width (or maximum width, depending on the binning algorithm), and dots are stacked, with each dot representing one observation.

Example:

set.seed(2211) # get the same random variables from the normal distribution
df <- data.frame(x= paste0("time",sort(rep(1:4, 25))), 
                 y= c(rnorm(25, 0, 2), 
                      rnorm(25, 1, 2), 
                      rnorm(25, 2, 2), 
                      rnorm(25, 3, 2)))
head(df)
##       x          y
## 1 time1  0.4339022
## 2 time1 -1.7532299
## 3 time1  3.1492460
## 4 time1  0.5317880
## 5 time1  2.2812159
## 6 time1  0.6126187
# look at distribution of y
ggplot(data=df, aes(y)) + geom_histogram(binwidth = 1)

# look at distribution of x
ggplot(data=df, aes(x)) + geom_bar()

# now lets look with dotplot
ggplot(data=df, aes(x, y)) +
  geom_dotplot(binaxis="y", 
               binwidth = 0.3, 
               stackdir = "center")

Your turn

Change the binwidth to 0.1, 0.5, 1 one by one. Then change the stackdir to “up” (default), “down”, “center”, “centerwhole”.

What’s happening?

Let’s go back to ToothGrowth data and apply geom_dotplot

gb + geom_dotplot(binwidth = 0.5,
                  aes(fill = regimen),
                  alpha = 0.3,
                  binaxis = "y", 
                  stackdir = "center")

ggplot2: Plotting means and error bars

The error bars can define multiple types of variation:

  1. Standard deviation (sd) shows how much each data point deviate from the mean “on average”. This is useful if you would like to have a general idea about how the data is distributed.

  2. Standard error of the mean (sem) gives a value explaining how much your estimated mean is affected by the sampling variability. It can be difficult to interpret.

  3. 95% Confidence interval (ci95) the interval which you are 95% certain that the true mean falls into. This is probably the best type of variation if we are really interested in estimating the mean.

As mentioned earlier, ggplot2 requires calculated summary statistics to be given to the ggplot() in a data frame form. This is why I keep mentioning dplyr package. It provides nice tools to manipulate data and get some summary statistics out of it.

So, now I will to compute the mean, sd, sem and CI95. To compute sem and CI95 I need to write my own functions…

Writing custom functions in R

Now I would like to write my own custom functions to compute some of the statistics.
For that I will use the function() function. For example, I would like to define a function called add5mult3mean that

  1. Takes a numeric vector
  2. Adds 5 to all of its elements
  3. Multiplies with 3
  4. And finally gives the mean of this modified vector
add5mult3mean <- function(x) mean((x+5)*3)

myVec <- 1:10
add5mult3mean(myVec)
## [1] 31.5
# or
mean((myVec+5)*3)
## [1] 31.5

You can name the argument whatever you want as long as it’s consistent throughout your function.
So the following would again give me the same:

add5mult3mean <- function(a_numeric_vector) mean((a_numeric_vector+5)*3)
add5mult3mean(myVec)
## [1] 31.5

You can add as many arguments as you want and use objects in your environment.

irisChopper <- function(breaks){
  chopped.iris <- data.frame(cat.Sep.Len = cut(iris$Sepal.Length, breaks = breaks),
             cat.Sep.Wid = cut(iris$Sepal.Width, breaks = breaks),
             cat.Pet.Len = cut(iris$Petal.Length, breaks = breaks),
             cat.Pet.Wid = cut(iris$Petal.Width, breaks = breaks),
             iris$Species)
  head(chopped.iris)
}

irisChopper(10)
##   cat.Sep.Len cat.Sep.Wid  cat.Pet.Len   cat.Pet.Wid iris.Species
## 1 (5.02,5.38] (3.44,3.68] (0.994,1.59] (0.0976,0.34]       setosa
## 2 (4.66,5.02]  (2.96,3.2] (0.994,1.59] (0.0976,0.34]       setosa
## 3 (4.66,5.02]  (2.96,3.2] (0.994,1.59] (0.0976,0.34]       setosa
## 4  (4.3,4.66]  (2.96,3.2] (0.994,1.59] (0.0976,0.34]       setosa
## 5 (4.66,5.02] (3.44,3.68] (0.994,1.59] (0.0976,0.34]       setosa
## 6 (5.38,5.74] (3.68,3.92]  (1.59,2.18]   (0.34,0.58]       setosa
irisChopper(0:10)
##   cat.Sep.Len cat.Sep.Wid cat.Pet.Len cat.Pet.Wid iris.Species
## 1       (5,6]       (3,4]       (1,2]       (0,1]       setosa
## 2       (4,5]       (2,3]       (1,2]       (0,1]       setosa
## 3       (4,5]       (3,4]       (1,2]       (0,1]       setosa
## 4       (4,5]       (3,4]       (1,2]       (0,1]       setosa
## 5       (4,5]       (3,4]       (1,2]       (0,1]       setosa
## 6       (5,6]       (3,4]       (1,2]       (0,1]       setosa

Now lets get back to our functions to calculate some statistics. Function sem() computes the standard error of the mean for a numeric vector. Similarly, ci95 function computes the mean confidence interval of the numeric vector given.

# here I am defining a function 
# that would compute the standard error of the mean
sem <- function(x) sd(x)/sqrt(length(x))

# here I am defining a function 
# that would compute the CI95.
# am computing this with t-distribution 
# since I don't know the population standard deviation and
# my sample size for each treatment group is less than 30 --- a general rule
# qt() function is the t-distribution quantile function. 
# It basically give the t-statistic value for the quantile given
ci95 <- function(x)  qt(0.975,df=length(x)-1)*sd(x)/sqrt(length(x))

If you run these commands, you’ll see that you have defined functions in your environment.

Now let’s apply these to our data using dplyr functions.

Compute stats for ToothGrowth

tg.stats <- ToothGrowth %>% 
  group_by(supp, dose) %>% # group by 
  summarise(., 
            mean.len = mean(len),  # mean
            sd.len = sd(len),      # sd
            se.len = sem(len),     # sem
            ci95.len = ci95(len),  # ci95
            n = length(len)) %>%   # count
 right_join(ToothGrowth)

Here I am using the pipe %>% and functions group_by(), summarise(), right_join() from the dplyr package. Let’s not get into details now, but very briefly this is what I am doing:

  1. ToothGrowth %>% I am piping the ToothGrowth data in the function that comes up next. This way I can keep piping things to other functions.
    • x %>% f(y) is the same as f(x, y)
    • y %>% f(x, ., z) is the same as f(x, y, z )
  2. group_by(supp, dose): Grouping it based on its supp and dose columns.

  3. summarize(...): Computing the statistics for every len observation in the respective supp and dose group. Statistics I compute are:
    -sd
    -sem
    -ci95
    length (this isn’t really a statistic is just the count of the observation in each group)

  4. right_join(ToothGrowth)I am appending the summary statistics to the original data

Now this is how tg.stats look

tg.stats
## # A tibble: 60 x 8
## # Groups:   supp [?]
##    supp   dose mean.len sd.len se.len ci95.len     n   len
##    <fct> <dbl>    <dbl>  <dbl>  <dbl>    <dbl> <int> <dbl>
##  1 VC      0.5     7.98   2.75  0.869     1.96    10   4.2
##  2 VC      0.5     7.98   2.75  0.869     1.96    10  11.5
##  3 VC      0.5     7.98   2.75  0.869     1.96    10   7.3
##  4 VC      0.5     7.98   2.75  0.869     1.96    10   5.8
##  5 VC      0.5     7.98   2.75  0.869     1.96    10   6.4
##  6 VC      0.5     7.98   2.75  0.869     1.96    10  10  
##  7 VC      0.5     7.98   2.75  0.869     1.96    10  11.2
##  8 VC      0.5     7.98   2.75  0.869     1.96    10  11.2
##  9 VC      0.5     7.98   2.75  0.869     1.96    10   5.2
## 10 VC      0.5     7.98   2.75  0.869     1.96    10   7  
## # ... with 50 more rows

Finally we can start plotting! Let’s start with line graphs.

Point and Line graphs

Step by step:

1. Just plot mean tooth length as data points

ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) + 
    geom_point(aes(x=dose, y= mean.len), size=4, shape=18) 

Oh you’re wondering about the shapes? We’ll come to that…

2. Add the error bars: Show confidence intervals

ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) + 
  geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
  geom_errorbar(aes(ymin=mean.len-ci95.len, ymax=mean.len+ci95.len), width=.1)

3. Add lines

ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) + 
  geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
  geom_errorbar(aes(ymin=mean.len-ci95.len, ymax=mean.len+ci95.len), width=.1) +
  geom_line() 

4. Add original data points

ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) + 
  geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
  geom_errorbar(aes(ymin=mean.len-ci95.len, ymax=mean.len+ci95.len), width=.1) +
  geom_line() + 
  geom_point(aes(x=dose, y=len, fill=supp), alpha=0.4, size=1.5)

Now that we have more or less the plot we want, let’s customize it further.

Customizing graphs with ggplot2

1. Change titles and set scale limits

Modify axis, legend, and plot labels see Reference ggplot2: Scale

ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) + 
  geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
  geom_errorbar(aes(ymin=mean.len-ci95.len, ymax=mean.len+ci95.len), width=.1) +
  geom_line() + 
  geom_point(aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5) +

# Here I add the titles I want    
  labs(x="Supplement Dose",
       y="Tooth Length",
       title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
       subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
       caption = "Based on the data from R")

Define the limits of axes

ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) + 
  geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
  geom_errorbar(aes(ymin=mean.len-ci95.len, ymax=mean.len+ci95.len), width=.1) +
  geom_line() + 
  geom_point(aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5) +

# Here I add the titles I want    
  labs(x="Supplement Dose",
       y="Tooth Length",
       title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
       subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
       caption = "Based on the data from R") +

# Change x and y axes limits
  xlim(0, 3) +
  ylim(0, 60)

2. Change legend title

We saw in the previous session how to change legend parameters with guides() function. To change the legend title we can use the same function. See (Reference ggplot2 Guides)[http://ggplot2.tidyverse.org/reference/index.html#section-guides-axes-and-legends]

ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) + 
  geom_point(aes(x=dose, y= mean.len), size=4, shape=18) +
  geom_errorbar(aes(ymin=mean.len-ci95.len, ymax=mean.len+ci95.len), width=.1) +
  geom_line() + 
  geom_point(aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5) +

# Here I add the titles I want    
  labs(x="Supplement Dose",
       y="Tooth Length",
       title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
       subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
       caption = "Based on the data from R") +
  
# Change the legend
  guides(colour=guide_legend("Supplement"), fill=guide_legend("Supplement"))

3. Jitter points

I want to customize the jitter of the points. To adjust positions look at position adjustment in ggplot2 Reference

pd <- position_dodge(0.1)

# my plot is getting very long, I'll assign it
tgp <- ggplot(tg.stats, aes(x=dose, y=mean.len, colour=supp)) + 
  geom_point(aes(x=dose, y= mean.len), size=4, shape=18, position=pd) +
  geom_errorbar(aes(ymin=mean.len-ci95.len, ymax=mean.len+ci95.len), width=.1, position=pd) +
  geom_line(position=pd) + 
  geom_point(aes(x=dose, y=len, fill=supp), alpha=0.3, size=1.5, position=pd) +


# Here I add the titles I want    
  labs(x="Supplement Dose",
       y="Tooth Length",
       title= "The Effect of Vitamin C on Tooth Growth in Guinea Pigs",
       subtitle= "Comparing delivery methods: Orange Juice vs Ascorbic Acid",
       caption = "Based on the data from R") +
  
# Change the legend
  guides(colour=guide_legend("Supplement"), fill=guide_legend("Supplement"))

tgp

4. Themes

Theme allows to make cosmetic changes

tgp + theme_bw()

tgp + theme_minimal()

5. Shapes